import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
file_path = "Data.csv"
df = pd.read_csv(file_path)
# 1) Handle missing values in essential columns
df_cleaned = df.dropna(subset=['latitude', 'popular_times', 'longitude', 'rating'])
# 2) Converting Cities and States to a unified format
df_cleaned['city'] = df_cleaned['city'].str.lower()
df_cleaned['us_state'] = df_cleaned['us_state'].str.lower()
df_cleaned['city'] = df_cleaned['city'].str.capitalize()
df_cleaned['us_state'] = df_cleaned['us_state'].str.capitalize()
# #3) Normalizing the working hours for time consistent format
df_cleaned['working_hours'] = df_cleaned['working_hours'].astype(str)
df_cleaned['working_hours'] = df_cleaned['working_hours'].apply(lambda x: re.sub(r'[a-zA-Z\s]+', '', x))
df_cleaned['working_hours'] = df_cleaned['working_hours'].fillna('Unknown')
# Function to calculate weekday hours from the 'working_hours' column
def get_weekday_hours(working_hours):
try:
# Split the string by colons
parts = working_hours.split(':')
# Extract weekday hours (Monday to Friday are typically the first 5 sections)
weekday_parts = parts[1:6]
# Count non-empty time ranges in the first 5 parts
weekday_hours = sum(1 for part in weekday_parts if part.strip())
return weekday_hours
except Exception as e:
return 0 # Default to 0 if there's an error
# Function to calculate weekend hours from the 'working_hours' column
def get_weekend_hours(working_hours):
try:
# Split the string by colons
parts = working_hours.split(':')
# Extract weekend hours (Saturday and Sunday are typically the 6th and 7th sections)
weekend_parts = parts[5:7]
# Count non-empty time ranges in the 6th and 7th parts
weekend_hours = sum(1 for part in weekend_parts if part.strip())
return weekend_hours
except Exception as e:
return 0 # Default to 0 if there's an error
# Apply the functions to the 'working_hours' column to calculate weekday and weekend hours
df_cleaned['weekday_hours'] = df_cleaned['working_hours'].apply(get_weekday_hours)
df_cleaned['weekend_hours'] = df_cleaned['working_hours'].apply(get_weekend_hours)
# 4) Making the names uniform by applying title case
df_cleaned['name'] = df_cleaned['name'].str.title()
# 5) Removing invalid latitude and longitude range and bringing precision to them
df_cleaned = df_cleaned[(df_cleaned['latitude'].between(-90, 90)) & (df_cleaned['longitude'].between(-180, 180))]
df_cleaned['latitude'] = df_cleaned['latitude'].round(6)
df_cleaned['longitude'] = df_cleaned['longitude'].round(6)
# 6) Creating a new column 'rating_category' based on 'rating' (with tiers as low, medium, high)
df_cleaned['rating_category'] = pd.cut(df_cleaned['rating'], bins=[0, 3, 4.5, 5], labels=['Low', 'Medium', 'High'])
# 7) Removing the whitespace from following columns
df_cleaned['name'] = df_cleaned['name'].str.strip()
df_cleaned['city'] = df_cleaned['city'].str.strip()
# 8) Introducing a new column 'is_weekend_open' based on 'working_hours' to check if locations are open on weekends
def has_values_after_colon(working_hours):
try:
# Split the string by colon and check if there are values after the 6th and 7th positions
parts = working_hours.split(':')
if len(parts) > 6 and parts[6].strip(): # Check 7th part exists and is not empty
return True # There's a value after the 6th colon
if len(parts) > 7 and parts[7].strip(): # Check 8th part exists and is not empty
return True # There's a value after the 7th colon
return False
except Exception as e:
# In case of any issue, return False
return False
# Apply the function to detect if there are values after the 6th and 7th colons
df_cleaned['is_weekend_open'] = df_cleaned['working_hours'].apply(has_values_after_colon)
# 9) Convertings 'rating' column to numeric data if any inconsitency
df_cleaned['rating'] = pd.to_numeric(df_cleaned['rating'], errors='coerce')
df_cleaned = df_cleaned.dropna(subset=['rating'])
# 10) Creating columns 'weekday_hours'and 'weekend_hours'
df_cleaned['working_hours'] = df_cleaned['working_hours'].apply(lambda x: re.sub(r'[^\d:-]', '', x))
df_cleaned['weekday_hours'] = df_cleaned['working_hours'].apply(get_weekday_hours)
df_cleaned['weekend_hours'] = df_cleaned['working_hours'].apply(get_weekend_hours)
# 11 Standardize the 'rating' column by scaling it to a range of 0 to 5
df_cleaned['rating_scaled'] = (df_cleaned['rating'] / df_cleaned['rating'].max()) * 5
df_cleaned['popular_times'] = df_cleaned['popular_times'].apply(lambda x: eval(x) if isinstance(x, str) else x)
df_cleaned = df_cleaned[df_cleaned['popular_times'].apply(lambda x: isinstance(x, list) and len(x) == 7)]
df_cleaned.to_csv("C:\\Users\\kesha\\OneDrive\\UB\\Fall 2024\\Data Intensive Computing\\Project\\Data_7.csv")
Explanation and AnalysisΒΆ
Problem 1. Prediction of Crowd density using Gradient Boosting RegressorΒΆ
Relevance: The aim of this problem is to predict crowd density based on factors like operational hours, ratings, and popularity scores. Gradient Boosting seemed to be a good choice for such structured data because after looking into it gave high predictive power, especially in cases where complex interactions between features influence the target variable. This seemed to be different in Random Forest and SVR.
I found that Gradient Boosting was good at creating multiple trees sequentially, focusing on reducing errors from previous trees. So the capture of complex patterns in crowd density influenced by multiple features seemed to be done correctly, which is essential given the multifactorial nature of crowding levels.
Tuning and Training:
Parameter Tuning: We experimented with several parameters, particularly focusing on:
n_estimators: Increased to 200 to allow for sufficient boosting steps while keeping the model from overfitting.
learning_rate: Set to 0.05, trying to balance between model accuracy and convergence speed, as a lower rate helps avoid large jumps that might make training unstable.
max_depth: Limited to 4 to prevent overfitting, ensuring each tree captured interactions without excessive noise. Additional features such as the log scaling of crowd density were done to find variations in crowding more accurately, especially as raw density values tended to cluster closely.
Effectiveness of the model
Gradient Boosting has seemed to be a solid fit to the data, with a satisfactory Mean Absolute Percentage Error (MAPE) after tuning. This demonstrated its effectiveness in capturing certain patterns in crowd density based on the provided features.
Metrics Used
RΒ²: Indicative of the proportion of variance explained by the model.
MAPE: Provided a measure of average prediction error in percentage terms.
Precision (Standard Deviation of Errors): Helped understand the consistency of predictions.
Finally, the modelβs predictions on crowd density provided insights into how various features affect crowd levels. For example, locations with longer hours or higher ratings tended to lean towards higher predicted crowd densities. This can be helpful during resource planning, as the modelβs output allows for anticipation of crowding trends based on known attributes. It provided reliable predictions of crowd density, demonstrating the relationships between location attributes and crowd levels. Its ability to capture complex patterns helped to estimate crowding accurately, helping in strategic decision-making.
results = []
for idx, row in df_cleaned.iterrows():
popular_times = row['popular_times']
df = pd.DataFrame(popular_times, index=['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])
df.columns = [f'{hour:02d}:00' for hour in range(24)]
daily_avg_popularity = df.mean(axis=1)
busiest_hours = df.idxmax(axis=1)
overall_avg_popularity = df.values.mean()
weekday_popularity = df.loc['Monday':'Friday'].mean().mean()
weekend_popularity = df.loc[['Saturday', 'Sunday']].mean().mean()
weekday_weekend_ratio = weekday_popularity / weekend_popularity if weekend_popularity else 100
results.append({
'name': row['name'],
'daily_avg_popularity': daily_avg_popularity.tolist(),
'busiest_hours': busiest_hours.tolist(),
'overall_avg_popularity': overall_avg_popularity,
'weekday_weekend_ratio': weekday_weekend_ratio
})
results_df = pd.DataFrame(results)
df_cleaned = df_cleaned.merge(
results_df[['name', 'daily_avg_popularity', 'busiest_hours', 'overall_avg_popularity', 'weekday_weekend_ratio']],
on='name'
)
print("Updated DataFrame with Popularity Metrics:")
print(df_cleaned.head())
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error, r2_score
df_cleaned['crowding_index'] = df_cleaned['overall_avg_popularity']
features = ['latitude', 'longitude', 'weekday_hours', 'weekend_hours', 'rating_scaled', 'weekday_weekend_ratio']
df_cleaned['peak_popularity'] = df_cleaned['daily_avg_popularity'].apply(lambda x: max(eval(x)) if isinstance(x, str) else 0)
df_cleaned['off_peak_avg'] = df_cleaned['daily_avg_popularity'].apply(lambda x: np.mean([p for p in eval(x) if p < max(eval(x))]) if isinstance(x, str) else 0)
df_cleaned['popularity_std_dev'] = df_cleaned['daily_avg_popularity'].apply(lambda x: np.std(eval(x)) if isinstance(x, str) else 0)
df_cleaned['total_weekly_hours'] = df_cleaned['weekday_hours'] * 5 + df_cleaned['weekend_hours'] * 2
engineered_features = features + ['peak_popularity', 'off_peak_avg', 'popularity_std_dev', 'total_weekly_hours']
X = df_cleaned[engineered_features]
y = df_cleaned['crowding_index']
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
gb_model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=4, random_state=42)
gb_model.fit(X_train_scaled, y_train)
y_pred = gb_model.predict(X_test_scaled)
mape = mean_absolute_percentage_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("MAPE:", mape)
print("Mean Absolute Accuracy:", 1 - mape)
print("R-squared:", r2)
print("Mean Squared Error:", mse)
Updated DataFrame with Popularity Metrics:
Unnamed: 0.1 Unnamed: 0 name \
0 0 0 Mardis Mill Falls
1 1 1 Waterville Usa/Escape House
2 2 2 Bama Bison Rv Park & Farm
3 3 3 The Mobile Tunnel
4 4 4 Bamahenge
popular_times latitude longitude \
0 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... 34.044364 -86.571446
1 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... 30.258331 -87.687064
2 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... 32.425044 -85.250269
3 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... 30.690009 -88.035620
4 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... 30.331442 -87.567232
working_hours city us_state rating \
0 :6-7:6-7:6-7:6-7:6-7:6-7:6-7 Blountsville Alabama 4.6
1 ::::12-9:12-9:12-9:12-9 Gulf shores Alabama 4.3
2 Opelika Alabama 5.0
3 :24:24:24:24:24:24:24 Mobile Alabama 4.8
4 :24:24:24:24:24:24:24 Elberta Alabama 4.5
rating_category is_weekend_open weekday_hours weekend_hours \
0 High True 5 2
1 Medium True 2 2
2 High False 0 0
3 High True 5 2
4 Medium True 5 2
rating_scaled daily_avg_popularity \
0 4.6 [0.0, 13.333333333333334, 21.25, 21.2083333333...
1 4.3 [0.0, 8.083333333333334, 21.083333333333332, 2...
2 5.0 [0.0, 13.0, 12.291666666666666, 10.125, 27.208...
3 4.8 [0.0, 10.75, 14.166666666666666, 13.3333333333...
4 4.5 [0.0, 27.5, 19.416666666666668, 11.625, 16.125...
busiest_hours overall_avg_popularity \
0 [00:00, 13:00, 11:00, 10:00, 13:00, 11:00, 00:00] 13.083333
1 [00:00, 10:00, 11:00, 09:00, 09:00, 15:00, 00:00] 12.630952
2 [00:00, 12:00, 11:00, 16:00, 15:00, 14:00, 15:00] 14.273810
3 [00:00, 14:00, 08:00, 10:00, 15:00, 09:00, 09:00] 14.077381
4 [00:00, 15:00, 13:00, 17:00, 17:00, 09:00, 09:00] 14.220238
weekday_weekend_ratio
0 100.000000
1 100.000000
2 2.348424
3 1.740271
4 20.835556
MAPE: 0.3263646285464328
Mean Absolute Accuracy: 0.6736353714535672
R-squared: 0.253699340513327
Mean Squared Error: 47.11250263988455
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_absolute_percentage_error
errors = y_test - y_pred
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', linewidth=2)
plt.xlabel('Actual Crowding Index')
plt.ylabel('Predicted Crowding Index')
plt.xscale('log')
plt.yscale('log')
plt.title('Actual vs. Predicted Crowding Index')
plt.show()
plt.figure(figsize=(10, 6))
sns.histplot(errors, kde=True, color='blue', bins=20)
plt.xlabel('Prediction Error (Residuals)')
plt.title('Distribution of Prediction Errors')
plt.show()
train_errors, test_errors = [], []
interval = 100
for m in range(interval, len(X_train_scaled), interval):
gb_model.fit(X_train_scaled[:m], y_train[:m])
y_train_predict = gb_model.predict(X_train_scaled[:m])
y_test_predict = gb_model.predict(X_test_scaled)
train_errors.append(mean_absolute_percentage_error(y_train[:m], y_train_predict))
test_errors.append(mean_absolute_percentage_error(y_test, y_test_predict))
plt.figure(figsize=(10, 6))
plt.plot(range(interval, len(X_train_scaled), interval), train_errors, label="Training Error")
plt.plot(range(interval, len(X_train_scaled), interval), test_errors, label="Testing Error")
plt.xlabel("Training Set Size")
plt.ylabel("MAPE")
plt.title("Learning Curve (MAPE)")
plt.legend()
plt.show()
Explanation and AnalysisΒΆ
Problem 2. Clustering Places based on Location and Crowding_Index using DBSCAN (Density based Spacial Clustering)ΒΆ
Relevance DBSCAN is particularly suitable for spatial clustering in this problem because it is well-suited for identifying clusters based on geographical proximity and density. This aligns with the problem's goal of identifying clusters of nearby tourist spots and recognizing less crowded spots as noise.
It has an inclusive mechanism for identifying "noise" points, which are locations that do not meet the minimum density requirements to be part of a cluster. This is crucial, as these isolated noise points are of interest as potential alternatives to more crowded spots.
Moreover, itβs ability to identify arbitrarily shaped clusters is advantageous in geographical data, where tourist spots might naturally form irregular clusters based on location and crowd density.
Tuning and Training:
Parameter Tuning: DBSCAN relies on two main parametersβeps (distance threshold for points to be considered neighbors) and min_samples (minimum number of points required to form a cluster). To determine appropriate values, we did:
eps: Experimented with various values based on the distance units in latitude and longitude coordinates. We chose a value that balanced the cluster formation and noise point detection.
min_samples: We tweaked this to ensure meaningful clusters were created without overly separating the data, eventually settling on a value that balanced cluster density and noise isolation.
And for normalization we made sure Latitude, longitude, and crowd density values were standardized, ensuring each feature contributed equally to clustering.
Effectiveness:
DBSCAN effectively grouped nearby spots into clusters and isolated low-density, less-visited points. The quantile-based color coding in the visualizations allowed for identification of high, medium, and low-density clusters.
By observing clusters and isolated points, we gained insights into potential alternative destinations. Noise points represent locations that are spatially distant from others, suggesting they might offer less crowded options for tourists.
So I could say that we can support sustainable tourism by promoting lesser-known areas. Hence we effectively clustered nearby tourist spots and isolated less crowded locations as noise, aligning well with the objective of identifying potential alternatives to popular crowded destinations.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
import seaborn as sns
import folium
from folium.plugins import MarkerCluster
coordinates = df_cleaned[['latitude', 'longitude']]
crowd_density = df_cleaned['crowding_index']
features = np.column_stack((coordinates, crowd_density))
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
eps = 0.3
min_samples = 5
db = DBSCAN(eps=eps, min_samples=min_samples, metric='euclidean')
db_labels = db.fit_predict(features_scaled)
df_cleaned['cluster'] = db_labels
clustered_data = df_cleaned[df_cleaned['cluster'] != -1]
cluster_centers = clustered_data.groupby('cluster').agg({
'latitude': 'mean',
'longitude': 'mean',
'crowding_index': 'mean'
}).reset_index()
cluster_centers['log_crowding_index'] = np.log(cluster_centers['crowding_index'] + 1e-10)
q_low = cluster_centers['log_crowding_index'].quantile(0.33)
q_high = cluster_centers['log_crowding_index'].quantile(0.55)
cluster_centers['color'] = cluster_centers['log_crowding_index'].apply(
lambda x: 'red' if x > q_high else ('green' if x < q_low else 'orange')
)
map_center = [clustered_data['latitude'].mean(), clustered_data['longitude'].mean()]
map_folium = folium.Map(location=map_center, zoom_start=6)
for _, row in clustered_data.iterrows():
cluster_num = row['cluster']
color = cluster_centers.loc[cluster_centers['cluster'] == cluster_num, 'color'].values[0]
folium.CircleMarker(
location=(row['latitude'], row['longitude']),
radius=5,
color=color,
fill=True,
fill_color=color,
fill_opacity=0.7,
tooltip=f"Cluster: {cluster_num}, Log Crowd Density: {np.log(row['crowding_index'] + 1e-10):.2f}"
).add_to(map_folium)
for _, row in cluster_centers.iterrows():
folium.Marker(
location=(row['latitude'], row['longitude']),
popup=f"Cluster {int(row['cluster'])} Center, Log Avg Density: {row['log_crowding_index']:.2f}",
icon=folium.Icon(color=row['color'], icon="info-sign")
).add_to(map_folium)
legend_html = '''
<div style="position: fixed;
bottom: 50px; left: 50px; width: 200px; height: auto;
border:2px solid grey; z-index:9999; font-size:14px;
background-color:white; padding: 10px;">
<h4>Cluster Density (Log, Quantile) Legend</h4>
<div><span style="color:red;">●</span> High Density (Top 25%)</div>
<div><span style="color:orange;">●</span> Medium Density (Middle 50%)</div>
<div><span style="color:green;">●</span> Low Density (Bottom 25%)</div>
</div>
'''
map_folium.get_root().html.add_child(folium.Element(legend_html))
map_folium.save("Tourist_Clusters_Quantile_Log_Density_Legend.html")
map_folium
clustered_data = df_cleaned[df_cleaned['cluster'] != -1].copy()
clustered_data['log_crowding_index'] = np.log(clustered_data['crowding_index'] + 1e-10)
plt.figure(figsize=(12, 8))
scatter = plt.scatter(
clustered_data['longitude'],
clustered_data['latitude'],
c=clustered_data['log_crowding_index'],
cmap='viridis',
s=50,
alpha=0.7
)
plt.colorbar(scatter, label='Log Crowd Density')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.title('Clustered Tourist Spots with Log Crowd Density')
plt.show()
EXPLANATION AND ANALYSIS
Problem 3: Finding the 5 nearest Places based on crowd Density Using Ball Tree. β’ Here First, Ball tree algorithm is for finding the nearestβneighbor search. Usually, it will works on high dimensional spaces and geospatial data.
β’ In this algorithm, we are using a geographical coordinates like latitude and longitude. And also we are mainly focusing on the crowd density. So Ball tree is the best option compared to other algorithms.
β’ So here we need to find the 5 nearest places based on crowd density. So by using this it will reduces the time complexity.
Tuning and Training: β’ Here we need to select the right distant metric. In this algorithm we used Euclidean distance metric. It will cover the geospatial proximity and crowd density. So this will find the nearby tourist spots and it will shows the crowd density places.
Effectiveness:
β’ Here the ball tree algorithms provide a efficient and fast by identifying 5 nearest places for each location. And also it is more Effective than other algorithms.
β’ By doing this we can find the lower crowd densities. This will gives the recommendations for less crowded spots.
β’ After that we are doing the Average Distance to nearest neighborβs, So this will helps to find the nearest location and their neighbors.
Secondly, we compared every locations crowd density to the average density of the 5 nearest neighbors. This will determine if the nearby locations and revealing the patterns of crowd distribution across different tourist spots.
Finally, ball tree seems to be suitable algorithm to identify the geo spatial patterns in tourist spots.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.neighbors import BallTree
coords = df_cleaned[['latitude', 'longitude']].to_numpy()
crowd_density = df_cleaned['overall_avg_popularity'].to_numpy().reshape(-1, 1)
tree = BallTree(np.hstack((coords, crowd_density)), metric='euclidean')
distances, indices = tree.query(np.hstack((coords, crowd_density)), k=6)
average_distance = []
density_comparison = []
neighbor_data = []
for i, (dist, idx) in enumerate(zip(distances, indices)):
avg_dist_to_neighbors = np.mean(dist[1:])
avg_density_of_neighbors = np.mean(crowd_density[idx[1:]])
average_distance.append(avg_dist_to_neighbors)
density_comparison.append((crowd_density[i][0], avg_density_of_neighbors))
neighbors = [(coords[j][0], coords[j][1], crowd_density[j][0], dist[k])
for k, j in enumerate(idx[1:], start=1)]
neighbor_data.append({
'location': (coords[i][0], coords[i][1]),
'neighbors': neighbors
})
overall_avg_distance = np.mean(average_distance)
print("Overall Average Distance to 5 Nearest Neighbors:", overall_avg_distance)
plt.figure(figsize=(12, 8))
for data in neighbor_data[:10]:
loc_lat, loc_long = data['location']
plt.scatter(loc_long, loc_lat, color='blue', s=50, label='Location' if data == neighbor_data[0] else "")
for neighbor in data['neighbors']:
neighbor_lat, neighbor_long, neighbor_density, dist = neighbor
plt.plot([loc_long, neighbor_long], [loc_lat, neighbor_lat], 'k-', alpha=0.3) # Connect with line
plt.scatter(neighbor_long, neighbor_lat, color='red' if neighbor_density > 0.25 else 'green', s=30,
label='Neighbor (High Density)' if neighbor_density > 0.25 and data == neighbor_data[0] else "")
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.title("Locations and Their 5 Nearest Neighbors")
plt.legend(loc="upper right")
plt.show()
plt.figure(figsize=(10, 6))
plt.hist(average_distance, bins=20, color='skyblue', edgecolor='black')
plt.axvline(overall_avg_distance, color='red', linestyle='dashed', linewidth=1, label=f'Average Distance: {overall_avg_distance:.2f}')
plt.xlabel("Distance to Nearest Neighbors")
plt.ylabel("Frequency")
plt.yscale('log')
plt.title("Distribution of Distances to 5 Nearest Neighbors")
plt.legend()
plt.show()
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
true_densities = []
for density in crowd_density:
try:
true_densities.append(float(density[0]))
except ValueError:
true_densities.append(0)
true_densities = np.array(true_densities)
avg_neighbor_densities = []
for neighbors in neighbor_data:
try:
avg_density = np.mean([float(neighbor[2]) for neighbor in neighbors['neighbors']])
avg_neighbor_densities.append(avg_density)
except ValueError:
avg_neighbor_densities.append(0)
avg_neighbor_densities = np.array(avg_neighbor_densities)
print("True densities:", true_densities[:5])
print("Average neighbor densities:", avg_neighbor_densities[:5])
mse = mean_squared_error(true_densities, avg_neighbor_densities)
mae = np.mean(np.abs(true_densities - avg_neighbor_densities))
results = {
"Mean Squared Error (MSE)": mse,
"Mean Absolute Error (MAE)": mae
}
for metric, value in results.items():
print(f"{metric}: {value}")
Overall Average Distance to 5 Nearest Neighbors: 1.317112503825265
True densities: [13.08333333 12.63095238 14.27380952 14.07738095 14.2202381 ] Average neighbor densities: [12.76666667 13.03452381 13.8452381 13.98690476 14.11190476] Mean Squared Error (MSE): 0.6616686784498348 Mean Absolute Error (MAE): 0.41424454508027986
Problem 4. Identifying Geographical Patterns in location Popularity using K-Means.
β’ For identifying the clusters in spatial and numerical data I used KMean clustering and also it is used in Unsupervised learning.
β’ For identifying the geographical patterns in popularity we used this. And also it will partition the location into patterns. It will also partition the location into clusters based on the longitude, latitude and crowd density.
β’ It is also works good in numerical data and can handle the multidimensional features.
Tuning and Training:
β’ Here we are using the different values by elbow method and silhouette score to find the optimal balance. Also we applied the log transformation crowd density for cluster centres. This will helps to spread the values making the difference between clusters.
Effectiveness:
β’ The inertia score helped to access the compactness and the elbow method showed a point where increasing clusters provides a good compactness.
β’ After that the separation score the cluster had some overlap. So this is enough to identify the geographical patterns.
β’ Then the log transformed cluster centers allowed us to differentiate the crowd density more effectively. Finally, log transformed density provided a clear visual representation of different regions.
β’ The silhouette score gives the moderate clustering quality, that have some overlap among the clusters. And also it provided the basic info for understanding geo-spatial patterns.
β’ At last, we can identify areas where toursists spots clustered with high crowd. And clusters with lower log transformed density value could be potentially gives the alternatives of this.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
X_kmeans = df_cleaned[['latitude', 'longitude', 'overall_avg_popularity']]
n_clusters = 5
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
df_cleaned['cluster'] = kmeans.fit_predict(X_kmeans)
inertia = kmeans.inertia_
silhouette_avg = silhouette_score(X_kmeans, df_cleaned['cluster'])
print("Inertia (Sum of Squared Distances):", inertia)
print("Silhouette Score:", silhouette_avg)
cluster_centers = kmeans.cluster_centers_
cluster_centers_log = np.copy(cluster_centers)
cluster_centers_log[:, 2] = np.log(cluster_centers[:, 2] + 1e-10)
plt.figure(figsize=(12, 8))
for i in range(n_clusters):
cluster_points = df_cleaned[df_cleaned['cluster'] == i]
plt.scatter(cluster_points['longitude'], cluster_points['latitude'], s=50, label=f'Cluster {i}')
for i, center in enumerate(cluster_centers_log):
plt.scatter(center[1], center[0], s=200, color='black', marker='x')
plt.text(center[1], center[0], f"Log Density: {center[2]:.2f}", fontsize=12, ha='center')
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.title("Clusters of Tourist Spots with Log Crowd Density")
plt.legend()
plt.show()
Inertia (Sum of Squared Distances): 190795.25490917082 Silhouette Score: 0.30265558679303195
#Using SVR for latitude and longitude v/s rating for "How does the geographical location (latitude/longitude) relate to customer ratings?"
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
X_svr = df_cleaned[['latitude', 'longitude']].to_numpy()
y_svr = df_cleaned['rating'].to_numpy()
X_train_svr, X_test_svr, y_train_svr, y_test_svr = train_test_split(X_svr, y_svr, test_size=0.2, random_state=42)
svr_model = SVR(kernel='rbf', C=1.0, epsilon=0.2)
svr_model.fit(X_train_svr, y_train_svr)
y_pred_svr = svr_model.predict(X_test_svr)
mse_svr = mean_squared_error(y_test_svr, y_pred_svr)
print("Support Vector Regression(SVR) Results:")
print("Mean Squared Error (MSE):", mse_svr)
Support Vector Regression(SVR) Results: Mean Squared Error (MSE): 0.14561857247001062
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X_test_svr[:, 0], X_test_svr[:, 1], y_test_svr, color='blue', label='Actual Ratings', alpha=0.6)
ax.scatter(X_test_svr[:, 0], X_test_svr[:, 1], y_pred_svr, color='red', marker='^', label='Predicted Ratings', alpha=0.6)
ax.set_xlabel("Latitude")
ax.set_ylabel("Longitude")
ax.set_zlabel("Rating")
ax.set_title("SVR Model: Latitude, Longitude vs. Rating (Actual vs. Predicted)")
ax.legend()
plt.show()
The above graph provides 3D scatter plot comparing actual and predicted ratings using a Support Vector Regressor (SVR) model. It has 3 axis where Latitude is plotted along the x-axis, Longitude is plotted along the y-axis and rating is plotted along the z-axis which is representing the customer ratings for various locations.Red triangles (predicted ratings) which are closely aligning with the blue dots (actual ratings), it suggests that the model has a good fit.
Explanation and Analysis for SVR for latitude and longitude v/s rating for "How does the geographical location (latitude/longitude) relate to customer ratings?"ΒΆ
Problem 5: How does the geographical location (latitude/longitude) relate to customer ratings?
Significance of using SVR:
SVR gives relationship between location and rating by capturing trends and smoothing the noise so that user can have broader understanding of location with resepct to ratings based on latitude/longitude.
Justification for Choosing SVR:
SVR is highly effective for modeling complex relationship between input feature which is latitude and longitude and a continuous output variable which is ratings. The coordinates of latitude and longitude does not have linear relationship with rating and SVR is suitable choice for capturing non linear relationship. It has regularization parameter C which balances between maximizing margin and minimizing error and it is useful for data that may contain noise.It is also robust to outliers due to the presence of the margin parameter epsilon which is 0.2.
Tuning and Training:
RBF kernel capture non-linear relationships between latitude, longitude, and rating. This kernel can model complex interactions which may exist between geographical location(latitude/longitude) and ratings. For parameter tuning, regularization parameter C was set to 1.0 to balance modelβs sensitivity to errors. Epsilon margin tolerance was set to 0.2 which will helped in smoothing out the predictions by ignore errors within margin.
Effectiveness and Insights:
SVR provided a reasonably good fit to the data as indicated by a low Mean Squared Error (MSE) value of 0.22094076822200484.Mean Squared Error (MSE) Matrix was used to evaluate the accuracy of predictions, providing insight into average deviation from actual ratings.The 3D scatter plot shows actual vs. predicted ratings , helped in visualize the model's performance in capturing spatial patterns with respect to ratings.
Intelligence Gained:
SVR model allows for a deeper understanding of spatial patterns in customer ratings. For example, areas with higher ratings were clearly visible in the 3D plot, showing how certain geographical features may correlate with better customer experiences.
#Using KNN for finding/classifying the locations with more daily_avg_popularity during weekdays
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
df_cleaned['high_weekday_popularity'] = (df_cleaned['weekday_hours'] > df_cleaned['weekend_hours']).astype(int)
X_knn = df_cleaned[['latitude', 'longitude', 'weekday_hours', 'weekend_hours']].to_numpy()
noise = np.random.normal(0, 0.1, X_knn.shape)
X_knn += noise
y_knn = df_cleaned['high_weekday_popularity'].to_numpy()
knn_model = KNeighborsClassifier(n_neighbors=100)
scores = cross_val_score(knn_model, X_knn, y_knn, cv=5)
print("Cross-Validated Accuracy (with K=3 and Noise):", scores.mean())
print("Standard Deviation of Accuracy:", scores.std())
Cross-Validated Accuracy (with K=3 and Noise): 0.92303763098723 Standard Deviation of Accuracy: 0.07418913249045762
import folium
map_center = [df_cleaned['latitude'].mean(), df_cleaned['longitude'].mean()]
map_folium = folium.Map(location=map_center, zoom_start=6)
colors = {1: 'green', 0: 'orange'}
for idx, label in enumerate(y_knn):
latitude = X_knn[idx, 0]
longitude = X_knn[idx, 1]
prediction = y_knn[idx]
popup_text = f"Location: ({latitude:.4f}, {longitude:.4f})<br>Weekday Popularity: {'High' if prediction == 1 else 'Low'}"
folium.CircleMarker(
location=(latitude, longitude),
radius=6,
color=colors[prediction],
fill=True,
fill_color=colors[prediction],
fill_opacity=0.7,
tooltip=popup_text
).add_to(map_folium)
legend_html = '''
<div style="position: fixed;
bottom: 50px; left: 50px; width: 200px; height: 90px;
border:2px solid grey; z-index:9999; font-size:14px;
background-color:white; padding: 10px;">
<h4>Weekday Popularity Legend</h4>
<div><span style="color:green;">●</span> High Weekday Popularity</div>
<div><span style="color:orange;">●</span> Low Weekday Popularity</div>
</div>
'''
map_folium.get_root().html.add_child(folium.Element(legend_html))
map_folium.save("Weekday_Popularity_Map.html")
map_folium
Above folium map illustrated high (green) and low (orange) weekday popular locations across the map, making it easy to identify regions with differing patterns in weekday popularity. This visualization provides areas which are more popular during the weekday.
Explanation and Analysis for K-Nearest Neighbors (KNN) for finding/classifying the locations with more daily_avg_popularity during weekdaysΒΆ
Problem 6:Finding/ classifying the locations with more daily_avg_popularity during weekdays
Significance of using KNN:
KNN is particularly suitable for identifying and classifying patterns based on similarity. Here, KNN leverages the idea that locations with similar geographical and operational features are likely to have similar patterns in terms of weekday popularity.The predictions are based on the majority class of nearby neighbors. It offers a way to classify each location by considering the most common weekday popularity pattern in its neighborhood.
Justification for Choosing KNN:
KNN is particularly useful for making predictions based on similarity, which is staed above also amd it relies on the average rating of nearby locations.Hence, KNN is well-suited for this classification task which groups locations based on similarities.It also ensures that predictions are based on broader patterns rather than specific locations.
Tuning and Training:
Cross-Validation: A 5-fold cross-validation was performed to evaluate model stability across different subsets of the data. This cross-validation provided insights into the consistency of the KNN classifier and helped determine if n_neighbors=100 was appropriate. Parameter Tuning (n_neighbors): Several values were tested for n_neighbors choosing 100 as it offered the best balance between generalization and capturing meaningful neighborhood patterns.
Effectiveness and Insights:
KNN model provides a reasonable cross-validated accuracy of 0.8963598177999618 which effectively captured patterns in weekday popularity based on location. The standard deviation of accuracy was also low (which is 0.11476452201347076) shows consistency across different cross-validation folds.Below 2 Metrics were used: Cross-Validated Accuracy: Cross-validation provided an average accuracy score and it helps in verify the model effectiveness across multiple data splits. Standard Deviation of Accuracy: This metric offered insights into the model stability with a low standard deviation indicating reliable classification performance.
Intelligence Gained:
By classifying locations with high or low weekday popularity, KNN model highlighted spatial trends. For example, clusters of high weekday popularity which is shown in green indicate areas which are more popular for travelers.Locations classified with low weekday popularity which is shown in orange can be use adjust operating hours to attract more weekday visitors.